library(ggplot2)

560

Problem Set

Due Date: Thursday, May 10

David Bacsik

Problem 1

In homework 2, you were investigating the de novo mutation rate of R. Waterstonii. In a given generation, you observed the following mutation counts in 12 different isolates:

mutations = c(12,15,10,4,6,15,18,12,7,11,5,14)

mutation_data = data.frame(mutations)

mutation_data

1a

Obtain 1,000 bootstrap samples from your data and recompute λ* for these bootstrap examples.

I will use the boot package to randomly resample this data 1000 times. Each time, I will compute lambda as the mean of the sample. The resulting distribtuion is:

library(boot)

set.seed(1)

samplemean = function(x, d) {
  return(mean(x[d]))
}

lambda_ests = boot(data = mutation_data$mutations, statistic = samplemean, R = 1000)

lambda_ests.vals = data.frame(t = lambda_ests$t[1:1000])

print(mean(lambda_ests.vals$t))
## [1] 10.69767
ggplot(lambda_ests.vals, aes(x=t)) + geom_histogram() +
   labs(title="Lambda estimate", subtitle='1000 bootstrap samplings', y="count", x = "lambda")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1b

**Compute the variance of λ*.**

var(lambda_ests$t[1:1000])
## [1] 1.493964

1c

Generate a 95% confidence interval for λ* using the normal approximation method. The lambda estimates should be approximatley normally distributed, because they are a stastic calculated over many random samples.

The edges of the 95% confidence interval of the normal approximation of this distribution are:

m = mean(lambda_ests$t[1:1000])
v = var(lambda_ests$t[1:1000])

qnorm(0.025, mean = m, sd = sqrt(v))
## [1] 8.302046
qnorm((1-0.025), mean = m, sd = sqrt(v))
## [1] 13.09329

1d

Generate a 95% confidence interval for λ* using the percentile method.

I want to find the values that lie, within our statistic distribution, at the edge of the 95% confidence interval.

This is simply the 25th (0.025%) and 975th (97.5%) term in the ordered list of the 1000 lambda estimates.

lambda_ests.sorted = sort(lambda_ests$t[1:1000])

lambda_ests.sorted[25]
## [1] 8.333333
lambda_ests.sorted[(1000-25)]
## [1] 13.08333

1e

Compare the results from c and d. Which do you prefer, and why?

In this case, I prefer the percentile approach. While the values are similar between the normal approximation and percentile approach, since the percential approach was computationally trivial, I prefer to use values that are present in the bootstrap set to describe the bootstrap set, rather than the inferred values of the normal distribution.

2

You are studying the effect of a new enzyme on the number of PCR cycles required for a reaction. You split each of your 16 samples into two and perform the reactions with the old and new enzymes. The following table summarizes your data on the number of cycles required for each pair of reactions using the old and new enzymes:

old = c(7,5,7,7,10,12,5,13,11,9,9,8,8,13,6,10)
new = c(6,8,8,2,9,9,7,7,4,7,8,7,7,11,8,6)

enzyme_data = data.frame(old, new)

enzyme_data

Approach A: Paired T-Test

2A - a

Briefly describe (in words) your strategy. Why did you choose this approach?

These data are paired. Each observation is from a single sample tested with two different conditions (old vs new enzyme). I will use a paired t-test to evaluate whether the conditions affect the observations.

2A - b

What are the null and alterative hypotheses for this test

The null hypothesis for this test is that the mean difference between pairs is 0. The alternative hypothesis is that the mean difference between pairs is not 0.

2A - c

What assumptions does your approach make? Are they all satisfied? How do you know?

-The data are continuous; this is true, assuming we know our measurment technique allows continous observations.
-Each (paired) observation is independent; this is also true, assuming we know our sampling method is not biased.
-The data are approximately normally distributed; by eye, this is true for the ‘new’ data, and it is unclear whether it is true for the ‘old’ data.

old_plot = ggplot(enzyme_data, aes(x=old)) + geom_bar()
new_plot = ggplot(enzyme_data, aes(x=new)) + geom_bar()

old_plot

new_plot

2A - d

What is the p-value from your analysis?

paired_enzyme_test = t.test(enzyme_data$old, enzyme_data$new, paired = TRUE)

paired_enzyme_test$p.value
## [1] 0.03890274

2A - e

What do you conclude (be precise about the conclusion)?

Because the p value is less than 0.05, I conclude that the mean of differences between the pairs is not 0. Because our experimental design only changed the enzyme in the reaction, we can infer causality. I infer that changing the enzyme produced a difference in the number of cycles required for each pair of reactions. Because this was a two-sided test, I cannot conclude the direction or magnitude of the difference, only that a difference is present.

Approach B: Wilcoxan Signed Rank Test

2B - a

Briefly describe (in words) your strategy. Why did you choose this approach? These data are paired. Each observation is from a single sample tested with two different conditions (old vs new enzyme). In the previous approach, paired t-test, I found that the data are not clearly normally distributed. Now, I will use a non-parametric test, the Wilcoxon Signed Rank Test, which does not require normally distributed data to be valid.

2B - b

What are the null and alterative hypotheses for this test The null hypothesis is that the populations the two samples are drawn from have the same median. The alternative hypothesis is that they are drawn from populations with different medians.

2B - c

What assumptions does your approach make? Are they all satisfied? How do you know?

-The data are continuous; this is true, assuming we know our measurment technique allows continous observations.
-Each (paired) observation is independent; this is also true, assuming we know our sampling method is not biased.
-The data have a symmmetric distribution; this is true for both ‘old’ and ‘new’ data sets.

2B - d

What is the p-value from your analysis?

paired_enzyme_wilcoxtest = wilcox.test(enzyme_data$old, enzyme_data$new, paired = TRUE)
## Warning in wilcox.test.default(enzyme_data$old, enzyme_data$new, paired =
## TRUE): cannot compute exact p-value with ties
paired_enzyme_wilcoxtest$p.value
## [1] 0.06432261

2B - e

What do you conclude (be precise about the conclusion)? The p-value given for the non-parametric test is 0.06, which is above the common threshold of 0.05. This test is less powerful than the t-test. However, this tests assumptions were met, while it is not clear that the t-test’s assumptions were met.

Given this value, I would think it is likely that the medians of the populations the old and new data are drawn from are different; because we know the experimental design, we can say that the median number of cycles needed to complete the reaction with the new enzyme is likely different from the median number needed to complete the reaction with the old enzyme. However, I would not be very confident in this conclusion.

3

3a

After your yearly checkup, the doctor has bad news and good news. The bad news is that you tested positive for a serious disease, and that the test is 99% accurate (i.e. the probability of testing positive given that you have the disease is 0.99, as is the probability of testing negative given that you don’t have the disease). The good news is that this is a rare disease, striking only 1 in 10,000 people. Why is it good news that the disease is rare? What are the chances that you actually have the disease?

Given that you tested positive, the chance that you actually have the disease depends on the accuracy of the test (a more specific test increases the chance that you have the disease) and the chance that you have the disease in the absence of testing (the background rate of the disease). Since the background rate is very low, even though you tested positive on the test, it is still overwhelmingly likely that you DO NOT have the disease:

A = you have the disease
B = you tested positive

\[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} \]

In the absence of testing, P(A) is = 1/10,000.
Since the test is 99% sensitive, P(B|A) is 0.99.
P(B) is calculated as the probability that you test positive given that you have the disease–P(B|A)–times the probability that you have the disease–P(A)–plus the probability that you test positive given that you don’t have the disease–P(B|not A)–times the probability that you don’t have the disease–P(not A): \[ P(B) = P(B|A) \cdot P(A) + P(B|not A) \cdot P(not A) \]

The probability that you don’t have the disease–P(not A)–is 1-(1/10,000), or 9,999/10,000. The probability that you test positive for the test given that you don’t have the disease–P(B|not A)–is 1-0.99 (specificity), or 0.01.

Finally, we can calculate P(B):

\[ P(B) = 0.99 \cdot \frac{1}{10,000} + 0.01 \cdot \frac{9,999}{10,000} = 0.010098 \]

Therefore, the chance that you actually have the disease, now that you test positive is: \[ P(A|B) = \frac{0.99 * 0.0001}{0.010098} = 0.0098 \]

This is about 100X higher than baseline, but it is still >99% likely that you DO NOT have the disease.

3b

Prove the conditionalized version of the general product rule:

Prove the conditionalized version of Bayes’ rule:

4

Say that you developed a potential Nobel prize worthy method of counting spontaneous deleterious mutations. You applied it to ten individuals and obtained the following data:

deleterious_muts = c(4,3,0,1,9,2,3,8,0,6)

human_muts = data.frame(deleterious_muts)

human_muts

4a

Write out the likelihood function for this data. (Hint: we discussed in lecture #3 the probability distribution that this type of data follows.)

This data is spontaneous mutations per person, which is a rate. Rates follow a poission distribution, which is described the equation:

\[ P(X = i) = e^{-\lambda} \frac{\lambda^i}{i!} \]

The function has only one parameter, \(\lambda\). For estimating likelihood, we will use \(\theta\). The likelihood of the parameter, \(\theta\), is given by the probability of observing the data with that parameter:

\[ L(\theta : D) = P(D|\theta) \]

In our case, the probability of seeing each specific number (i) of deleterious mutations is P(X = i), and the joint probability of the data is just the product of each individual observation.

To generatize the likelihood function, there will a term with \(e^{-\theta}\) raised to n (the number of observations in the dataset):
\[(e^{-\theta})^{n}\]

There will also be a term with \(\theta\) raised to the sum of all mutations (i) observed:
\[\theta^{\sum_{i_1}^{i_n} i}\]

Finally, we will capture the factorial for each observation. This will be a term of \[\frac{1}{i_1! \cdot i_2! \cdot i_3! ... i_n! }\]

All together, the likelihood function will be: \[ L(\theta : D) = P(D|\theta) = (e^{-\theta})^{n} \cdot \theta^{\sum_{i_1}^{i_n} i} \cdot \frac{1}{i_1! \cdot i_2! \cdot i_3! ... i_n! } \]

In the case of our data, n = 10, the sum of all observations is 36, and the factorial term can be calculated.

\[ L(\theta : D) = P(D|\theta) = (e^{-\theta})^{10} \cdot \theta^{\sum_{i_1}^{i_n}i} \cdot \frac{1}{i_1! \cdot i_2! \cdot i_3! ... i_n! } \]

\[ L(\theta : D) = P(D|\theta) = (e^{-\theta})^{10} \cdot \theta^{36} \cdot \frac{1}{4! \cdot 3! \cdot 0! ... 6! } \]

\[ L(\theta : D) = P(D|\theta) = (e^{-\theta})^{10} \cdot \theta^{36} \cdot \frac{1}{18203705081856000} \]

4b

Calculate the MLE of the average number of spontaneous deleterious mutations per individual from the data above. You can do this in any of the three ways that we discussed in class (graphically, numerically or calculus).

I will do this graphically. First, I will define an r function (‘l.likelihood’) that calculates the log likelihood of observing our datset, based on the value of theta:

l.likelihood = function(t){
  return(log((exp(-t)^10)*(t^36)*(1/18203705081856000)))
}

Then, I will plot the likelihood for a range of parameter values. Since all of our data have values less than 10, I will plot the values from 0 to 50; you cannot have less than 0 spontaneous mutations.

param.vals = seq(0, 50, 0.01)
ll.ests = data.frame(param.vals, likelihood = l.likelihood(param.vals))

ll.plot = ggplot(data = ll.ests, aes(x=param.vals, y=likelihood)) + geom_point() +
   labs(title="Maximum Likelihood Estimation", y="log likelihood", x = "average spontanenous mutations per genome")

ll.plot

From this plot, it is clear that the maximum likelihood falls somewhere around (but below) 5 mutations per genome. We can use a bit of r to find the mutation rate where the max of the plot falls:

max(ll.ests$likelihood)
## [1] -27.32678
which.max(ll.ests$likelihood)
## [1] 361
ll.ests$param.vals[which.max(ll.ests$likelihood)]
## [1] 3.6

From this, I find a maximum likelhood estimate of 3.6.

4c

Plot the log-likelihood of the data as a function of the parameter.

See plot (above).

5

You are interested in the transcriptional changes during early stages of the innate immune response. You obtain lymphoblast cell lines from 10 individuals and for each one measure expression levels at baseline (untreated) and following treatment with the drug immiquimod (which is a TLR8 agonist). The following table shows gene expression levels for a particular transcript.

baseline = c(-0.24,0.25,1.12,-0.06,0.46,0.17,0.02,1.10,0.55,0.98)
stimulated = c(1.74,2.1,1.65,2.65,3.11,2.31,1.87,3.21,2.19,1.75)

lympho_data = data.frame(baseline, stimulated)

lympho_data

5a

Perform a one sample t-test to test the hypothesis that baseline expression levels are significantly different than zero. Clearly state the null and alternative hypotheses and submit R code, test statistic value and p-value.

The null hypothesis is that the mean of the baseline values is equal to zero. The alternative hypothesis is that the mean is not equal to zero.

baseline_zero = t.test(lympho_data$baseline, mu = 0)

baseline_zero$statistic
##        t 
## 2.782491
baseline_zero$p.value
## [1] 0.02131382

The t statistic is 2.78. The p-value is 0.02. If I wanted to take a conservative interpretation, I could set my cutoff at 0.01. Since 0.02 in not less than 0.01, I would not reject the null hypothesis.

5b

Use a paired t-test to test the hypothesis that gene expression levels are significantly different between baseline and stimulated conditions. Again, clearly state the null and alternative hypotheses and submit R code, test statistic value and p-value.

The null hypothesis is that the mean difference between pairs is 0. The alternative is that the mean difference between pairs is not 0.

stim_base = t.test(lympho_data$baseline, lympho_data$stimulated, paired=TRUE)

stim_base$statistic
##         t 
## -8.158317
stim_base$p.value
## [1] 1.892061e-05

The test statistic is -8.158317 and the p-value is 1.89 e-5. This is much less than 0.01, so I reject the null hypothesis and conclude that the mean difference between pairs is not 0.

5c

Formally, let x_i and y_i denote the expression level for the i − th individual in baseline and stimulated conditions, respectively. Then define z_i = y_i − x_i. Perform a one sample t-test on the vector of z_i values. Clearly state the null and alternative hypotheses and submit R code, test statistic value, and p-value. How does your result compare to that obtained from part b?

lympho_data$diff = lympho_data$stimulated - lympho_data$baseline
lympho_data

The null hypothesis is that the mean of the difference (lympho_data$diff) between each pair is 0. The alternative hypothesis is that it is not zero.

diff_zero = t.test(lympho_data$diff, mu = 0)

diff_zero$statistic
##        t 
## 8.158317
diff_zero$p.value
## [1] 1.892061e-05

The test statistic is 8.158317, which is the same value but the opposite sign as the paired t-test. The p-value is 1.89e-5, which is the same as the paired t-test; this is simply a matter of which value was subtracted. In this case, as well, I am confident in rejecting the null hypothesis that the mean difference between pairs is 0.